Port SitemapExtractor from `CC-MRJob` to `CC-PySpark` #54

damian0815 · 2025-10-17T15:27:42Z

Port cc-mrjob/sitemaps_from_robotstxt.py.

Basic functionality unit tests
warcio implementation
- Validate output is identical to MRJob output with "test" robotstxt in MRJob repo
- Validate on recent full-scale crawl output
fastwarc implementation
unit test to validate text encoding edge cases and validity (currently all test cases are completely valid utf8)
check output works with crawl-tools/server/seed/sitemaps/sitemaps_robotstxt.py

Signed-off-by: Damian Stewart <[email protected]>

sebastian-nagel

First pass. I'll continue with testing. But several points need a discussion.

requirements.txt

sitemaps_from_robotstxt.py

test/test_sitemaps_from_robotstxt.py

sitemaps_from_robotstxt.py

sparkcc.py

sitemaps_from_robotstxt.py

Signed-off-by: Damian Stewart <[email protected]>

sebastian-nagel

Great! Some minor things remain to do.

Tested on my local machine:

successfully run the unit tests using both the pyspark module or the installed Spark. For the latter, it's required to set PYTHONPATH=$PWD/test:$(ls $SPARK_HOME/python/lib/py4j-*-src.zip):$SPARK_HOME/python:$PYTHONPATH
successfully tested sitemaps_from_robotstxt.py. Output looks good. On a small test sample, there are no differences in the number of extracted sitemap URLs with the cc-mrjob implementation.
failed to run sitemaps_from_robotstxt_fastwarc.py

sebastian-nagel · 2025-10-30T18:10:32Z

requirements.txt

 orjson
 warcio

+# for validating URLs in robots.txt:


Not required anymore.

sebastian-nagel · 2025-10-30T18:10:48Z

sitemaps_from_robotstxt.py

+from typing import Optional
+from urllib.parse import urlparse, urljoin
+
+import validators


Not required anymore.

sebastian-nagel · 2025-10-31T09:35:50Z

test/test_sitemaps_from_robotstxt.py

+def test_host_accumulation_same_host(spark):
+    """
+    Test accumulation of hosts when sitemap url host and robots.txt url host match
+    Requires test/ on PYTHONPATH so utils._process_jobs can be imported


A 4-line function could be inlined to avoid the import error.

However, test/ is required on the path also to load test_sitemaps_from_robotstxt which is required for Spark serialization. So, you need to run it like

PYTHONPATH=$PYTHONPATH:./test python -m pytest test -v

README has been updated with this information

sitemaps_from_robotstxt.py

sebastian-nagel · 2025-10-31T10:46:14Z

sitemaps_from_robotstxt.py

+    robots_txt_with_more_than_50_sitemaps = None
+
+
+    def init_accumulators(self, session):


Also log_accumulators needs to be overridden, otherwise the class-specific accumulators are never shown resp. not preserved once the job has finished. See cc_index_word_count.py or wat_extract_links.py.

sebastian-nagel · 2025-10-31T10:58:09Z

sitemaps_from_robotstxt_fastwarc.py

+    # process only WARC response and metadata (including WAT) records
+    fastwarc_record_filter = WarcRecordType.response
+
+    # process_record is implemented by SitemapExtractorJob


The "main" block is required in order to run the job.

Please, also run the job to verify that there are no errors.

sebastian-nagel · 2025-10-31T11:02:25Z

sitemaps_from_robotstxt.py

+
+                if robots_txt_url is None:
+                    # first sitemap found: set base URL and get host from URL
+                    robots_txt_url = record.rec_headers['WARC-Target-URI']


This is not compatible with FastWARC, should be self.get_warc_header(record, 'WARC-Target-URI')

Signed-off-by: Damian Stewart <[email protected]>

sebastian-nagel

Excellent, @damian0815!

Unit tests pass, successfully run both versions (warcio and fastwarc) locally.

I'll also run the job on a real cluster later today and merge the PR if this test passes as well.

Would you mind to squash the commits to a small and meaningful number? But I can do it when merging as well. Thanks!

damian0815 · 2025-10-31T16:43:53Z

I think it's easier/cleaner if you squash when merging?

sebastian-nagel

I've successfully run sitemaps_from_robotstxt_fastwarc.py on 5% of the robots.txt captures of CC-MAIN-2025-43 using a single-node Hadoop cluster (Spark on YARN):

13:57:14.391 [Thread-5] INFO  SitemapExtractorFastWarc - WARC/WAT/WET input files processed = 5000
13:57:14.394 [Thread-5] INFO  SitemapExtractorFastWarc - WARC/WAT/WET input files failed = 0
13:57:14.396 [Thread-5] INFO  SitemapExtractorFastWarc - WARC/WAT/WET records processed = 4444146
13:57:14.398 [Thread-5] INFO  SitemapExtractorFastWarc - robots.txt successfully parsed = 4444146
13:57:14.401 [Thread-5] INFO  SitemapExtractorFastWarc - sitemap urls found = 3949561
13:57:14.403 [Thread-5] INFO  SitemapExtractorFastWarc - sitemap urls with invalid utf-8 encoding = 76
13:57:14.405 [Thread-5] INFO  SitemapExtractorFastWarc - robots.txt announcing at least 1 sitemap = 1885795
13:57:14.408 [Thread-5] INFO  SitemapExtractorFastWarc - robots.txt with more than 50 sitemaps = 3547

I'll merge the code. Thanks, @damian0815!

Damian Stewart added 5 commits October 16, 2025 15:48

well-formed example passes

70bdd93

wip (back to basics on pyspark)

92f30c6

identical output with different format and sort order

fed0b2c

typo

9bb3f41

fix failing tests by spinning up a proper spark session and RDD

458cb2e

damian0815 marked this pull request as draft October 17, 2025 15:27

Damian Stewart added 4 commits October 20, 2025 11:39

exactly match cc-mrjob output

ada1471

Signed-off-by: Damian Stewart <[email protected]>

wip content encoding and invalid url tests

1f82d33

Add test for >50 sitemaps; add accumulator checks to tests

35de5ca

Signed-off-by: Damian Stewart <[email protected]>

add fastwarc implementation

0bd0b4a

damian0815 marked this pull request as ready for review October 20, 2025 14:24

damian0815 requested a review from sebastian-nagel October 20, 2025 14:24

fix logging

d17683f

Signed-off-by: Damian Stewart <[email protected]>

sebastian-nagel requested changes Oct 29, 2025

View reviewed changes

damian0815 commented Oct 30, 2025

View reviewed changes

sitemaps_from_robotstxt.py Outdated Show resolved Hide resolved

simplify & address review comments

41a3d0b

Signed-off-by: Damian Stewart <[email protected]>

damian0815 requested a review from sebastian-nagel October 30, 2025 14:15

Damian Stewart added 9 commits October 30, 2025 15:17

add exception handling for urljoin

d50a39a

Signed-off-by: Damian Stewart <[email protected]>

add python unit test github workflow

c163959

Signed-off-by: Damian Stewart <[email protected]>

downgrade from uv

1ba160f

remove same-host optimization

e2e9646

split PySpark into its own requirements.txt; update README

94a98d7

Signed-off-by: Damian Stewart <[email protected]>

typo

6a23d23

Signed-off-by: Damian Stewart <[email protected]>

typos

8339e6c

Signed-off-by: Damian Stewart <[email protected]>

polish

be9a046

Signed-off-by: Damian Stewart <[email protected]>

cleanup remaining spark-submit invocations

0210cd8

sebastian-nagel requested changes Oct 31, 2025

View reviewed changes

Damian Stewart added 3 commits October 31, 2025 14:35

update docs with expanded PySpark info

cb9c4ca

typo; clarify debugging limitations

2dd22dc

Signed-off-by: Damian Stewart <[email protected]>

address review comments

126b31b

damian0815 requested a review from sebastian-nagel October 31, 2025 14:17

sebastian-nagel approved these changes Oct 31, 2025

View reviewed changes

remove paranoid logging exception checks

e18e82f

sebastian-nagel approved these changes Nov 1, 2025

View reviewed changes

sebastian-nagel merged commit cc70f85 into main Nov 1, 2025
4 checks passed

		robots_txt_with_more_than_50_sitemaps = None


		def init_accumulators(self, session):

Port SitemapExtractor from CC-MRJob to CC-PySpark #54

Port SitemapExtractor from CC-MRJob to CC-PySpark #54

Uh oh!

Conversation

damian0815 commented Oct 17, 2025 • edited by sebastian-nagel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sebastian-nagel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sebastian-nagel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel left a comment

Choose a reason for hiding this comment

Uh oh!

damian0815 commented Oct 31, 2025

Uh oh!

sebastian-nagel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Port SitemapExtractor from `CC-MRJob` to `CC-PySpark` #54

Port SitemapExtractor from `CC-MRJob` to `CC-PySpark` #54

damian0815 commented Oct 17, 2025 •

edited by sebastian-nagel

Loading